Red Wine Data Exploration by Shivam Bhardwaj

What properties contributes in making of good red wine? In this project we try to answer this question by exploring the red wine data set.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.900  
##  1st Qu.: 7.100   1st Qu.:0.3950   1st Qu.:0.0900   1st Qu.:1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median :2.200  
##  Mean   : 8.259   Mean   :0.5288   Mean   :0.2661   Mean   :2.409  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.:2.600  
##  Max.   :13.200   Max.   :1.5800   Max.   :1.0000   Max.   :8.300  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.25      
##  Median :0.07900   Median :13.00       Median : 37.00      
##  Mean   :0.08699   Mean   :15.17       Mean   : 44.52      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 60.00      
##  Max.   :0.61100   Max.   :46.00       Max.   :144.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9967   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.316   Mean   :0.6569   Mean   :10.43  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7275   3rd Qu.:11.10  
##  Max.   :1.0029   Max.   :4.010   Max.   :2.0000   Max.   :14.00  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 'data.frame':    1534 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Most of the quality ratings are either 5 or 6; with 5 being the most frequent. Quality is a categorical discrete variable, but if we were to treat it as continuous, the mean would be 5.63 and the median would be 6. The highest rating was 8, and the lowest was 3. Additionally, total sulfur dioxide and free sulfur dioxide appeared to be discrete variables.This is likely due to rounding issues. I would also think that citric acid is a subset of fixed acidity and potentially volatile acidity.

Fixed acidity,residual sugar, total sulfur dioxide, and free sulfur dioxide were all stripped from their top 1% values as they appeared to be large outliers.

Univariate Plots Section

This red wine data set contains 1,599 obersvations with 11 variables on the chemical properties of the wine.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Quality Distribution

The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties

Transformed the long tailed sulphates data for a more accurate distribution. The log10 produces a relatively normal distribution, and there is nothing particularly striking with the transformations, as given that there are only 1534 observations being analyzed, it’s very likely that many sulphate measurements won’t be included in the data set. Variance decreases for log10 sulphates and graph looks more normal so will keep it.

It appears that we can actually group wine quality into three distinct categories: bad, average, and excellent. Most of the red wines were average, followed by excellent, and then bad. It seems like the red wines overall were very average, with a few having excellent tastes. I’m interested in what makes a wine excellent or bad – not what makes it average.

##       bad   average excellent 
##        62      1264       208

Univariate Analysis

Some observation on the distribution of the chemical property can be made:

Rescale Variable

Skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

What is the structure of your dataset?

There are 1534 observations after slicing out the top 1% from the variables that had large outliers (Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide)

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature. I want to determine what makes a wine taste good or bad.

What other features in the dataset do you think will help support your analysis

Did you create any new variables from existing variables in the dataset?

Yes, I created a rating variable which is a subset of quality based on three distinct categories: (bad: 4,5), (average: 5,6), (excellent: 7,8)

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • The top 1% of values were stripped off of fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.
  • The x column was removed as it was simply an index value of unimportance.
  • Sulphates appeared to be skwed and were log-transformed which revealed a normal distribution.

Bivariate Plots Section

Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.

The top 4 factors that is correlated with the wine quality (with a correlation coeffcient greater than 0.2)

Property r-value
alcohol 0.49
volatile.acidity -0.39
sulphates 0.256
citric.acid 0.223

From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.

To see if the data makes sense chemically, I first plot pH and fixed acidity. The correlation coefficient is -0.68, meaning that pH tends to drop at fixed acidity increases, which makes sense.

## [1] -0.6794406

The correlation between citric acid and pH is slightly weaker, being -0.52.This adds up as citric acid is a subset of fixed acidity.

## [1] -0.5283267

Volatile acidity (acetic acid) seems to increase when pH level increases. The correlation coefficient was 0.24 indicating some positive correlation.

## [1] 0.2387919

I want to further explore alcohol, pH, volatile acidity, citric acid, and sulphates and see how they relate to the quality of the wine as they all had correlation coefficients greater than 0.2. Box plots are used and we use the median as a better measure for the variance in our data. As predicted, the median also follows suit with the correlation coefficients. The boxplots provide an extremely interesting fact about alcohol – alcohol content is significantly higher for excellent wines compared to bad or average wines. Sulphates and citric acid also seem to be positively correlated to to quality, and volatile acidity appear to be negatively correlated.

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.302   3.380   3.385   3.500   3.900 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.210   3.310   3.315   3.402   4.010 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.280   3.295   3.380   3.780

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.20   10.98   13.10 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.00   10.26   10.90   14.00 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   10.80   11.60   11.54   12.22   14.00

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0750  0.1713  0.2675  1.0000 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2538  0.4000  0.7600 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.3950  0.3687  0.4900  0.7600

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4925  0.5600  0.5927  0.6000  2.0000 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6457  0.7000  1.9800 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7444  0.8200  1.3600

However, none of the variables share much in common with alcohol - the highest is pH, which had a correlation coefficient of 0.22. However, alcohol and quality have a 0.49 correlation coefficient, which may be leading me somewhere.

It appears that when citric acid is in higher amounts, sulphates are as well. The freshness from the citric acid and the antimicrobial effects of the sulphates are likely correlated. The correlation coefficient was 0.33 which indicates weak correlation, but still noteworthy.

## [1] 0.3302825

When graphing volatile acidity and citric acid, there is clearly a negative correlation between the two. It seems that fresher wines tend to avoid the use of acetic acid. The correlation coefficient was -0.57, indicating that larger amounts of citric acid meant smaller amounts of volatile acidity. Since volatile acidity is essentially acetic acid, the wine makers would likely not put a large amount of two acids in the wine, leading them to choose one or the other.

## [1] -0.5629224

There is no particularly striking relationship between alcohol and pH – a weak positive correlation of 0.22.

## [1] 0.2166557

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It appears that when citric acid is in higher amounts, sulphates are as well. The freshness from the citric acid and the antimicrobial effects of the sulphates are likely correlated. Volatile acidity and citric acid are negatively correlated. It is likely that fresher wines avoid the bitter taste of acetic acid. Citric acid and pH were also negatively correlated – a lower pH indicates a higher acidity. pH and alcohol are very weakly correlated. Pure alcohol (100%) has a pH of 7.33, so when it is diluted it will likely increase the pH level ever so slightly.

The boxplots reveal an interesting picture as well:

  • The median for sulphates increased for each quality type. The biggest jump was from average to excellent, with a median of aproximately 0.74 for excellent and 0.61 for average.
  • Citric acid had the highest concentration for excellent wines. The median jumped evenly throughout the different quality categories. With medians of 0.075 for bad, 0.24 for average, and 0.395 for excellent.
  • As volatile acidity increased, the median for the wine became worse, with medians of 0.68 for bad, 0.54 for average, and 0.37 for excellent. It’s possible that past a certain threshold,the acetic acid became too bitter for the tasters.
  • The median for alcohol content (10%) was the same the wine was bad or average. However, for the excellent wines, the alcohol content was 11.6%. This leads to a striking observation: a higher alcohol content may make a wine excellent from average, however there are other factors at play that make a wine taste bad.
  • pH didn’t change significantly much between the wines, with medians of 3.38 for bad, 3.31 for average, and 3.280 for excellent.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity and citric acid were negatively correlated, as were citric acid and pH. Fixed acidity and pH were negatively correlated, due to the lower pH more acidic effect.

What was the strongest relationship you found?

From the variables analyzed, the strongest relationship was between Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.`

Multivariate Plots Section

Main Chemical Property vs Wine Quality

With different colors, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality are discrete numbers. We can use jitter plot to alleviate this problem

We can see higher quality wine have higher alcohol and lower volatile acidity.

Add Another Feature

Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates (hue).

Main Chemical Properties vs Wine Quality

Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wine qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).

Linear Multivariable Model

Linear multivariable model was created to predict the wine quality based on chemical properties.

The features are selected incrementally in order of how strong the correlation between this feature and wine quality.

## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates, 
##     data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide + density, 
##     data = redwine)
## 
## ======================================================================================================
##                            m1         m2         m3         m4         m5         m6          m7      
## ------------------------------------------------------------------------------------------------------
##   (Intercept)            6.567***   2.942***   2.451***   2.492***   2.608***   2.875***   -3.001     
##                         (0.060)    (0.190)    (0.201)    (0.207)    (0.208)    (0.214)    (13.026)    
##   volatile.acidity      -1.761***  -1.361***  -1.193***  -1.242***  -1.132***  -1.082***   -1.097***  
##                         (0.108)    (0.098)    (0.100)    (0.117)    (0.120)    (0.119)     (0.124)    
##   alcohol                           0.327***   0.322***   0.322***   0.305***   0.286***    0.291***  
##                                    (0.016)    (0.016)    (0.016)    (0.017)    (0.017)     (0.021)    
##   sulphates                                    0.694***   0.711***   0.886***   0.943***    0.935***  
##                                               (0.102)    (0.105)    (0.113)    (0.113)     (0.114)    
##   citric.acid                                            -0.087      0.017      0.052       0.022     
##                                                          (0.108)    (0.111)    (0.110)     (0.128)    
##   chlorides                                                         -1.632***  -1.778***   -1.752***  
##                                                                     (0.412)    (0.410)     (0.414)    
##   total.sulfur.dioxide                                                         -0.003***   -0.003***  
##                                                                                (0.001)     (0.001)    
##   density                                                                                   5.854     
##                                                                                           (12.976)    
## ------------------------------------------------------------------------------------------------------
##   R-squared                  0.1        0.3        0.3        0.3        0.4        0.4        0.4    
##   adj. R-squared             0.1        0.3        0.3        0.3        0.3        0.4        0.4    
##   sigma                      0.7        0.7        0.7        0.7        0.7        0.6        0.6    
##   F                        267.7      366.4      266.7      200.2      164.8      143.2      122.7    
##   p                          0.0        0.0        0.0        0.0        0.0        0.0        0.0    
##   Log-likelihood         -1730.3    -1553.8    -1531.2    -1530.8    -1523.0    -1511.5    -1511.4    
##   Deviance                 857.2      681.0      661.2      660.9      654.2      644.5      644.4    
##   AIC                     3466.6     3115.6     3072.3     3073.7     3060.0     3039.0     3040.8    
##   BIC                     3482.6     3137.0     3099.0     3105.7     3097.4     3081.7     3088.8    
##   N                       1534       1534       1534       1534       1534       1534       1534      
## ======================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Throughout my analysis, chlorides and residual sugar lead to dead ends. However, high volatile acidity and low sulphates were a strong indicator of the presence of bad wine. High alcohol content, low volatile acidity, higher citric acid, and lower sulphates all made for a good wine.

Were there any interesting or surprising interactions between features?

Surprisingly, other chemical proprieties do not have strong correlation with wine quality, such as the residual sugar and PH .

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model

Yes, I created a linear model using seven variables: alcohol, citric acid, sulphates, volatile acidity, chlorides, total.sulfur.dioxide and density. The model was less precise in predicting qualities of 3, 4, 7, and 8, where the error was +/- 2. For qualities of 5 and 6, the majority of predictions were off by 0.5 and 1 for each bound. The limitations of this model are obvious – I’m trying to use a linear model for data that obviously isn’t perfectly linear.


Final Plots and Summary

Plot One: Distribution of Wine Quality

Description One

The simplest but most informative, Plot shows the distribution of wine quality. Note that dataset is unbalanced and the distribution is mostly normal. It has many count for medium quality (grade 5, 6), but much fewer count on low (grade 3,4) and high (grade 7, 8) quality wine.

Plot Two: Alcohol & Sulphates vs. Quality & Volatile Acidity vs Quality

Description Two

The 4 features are also represented in the scatter plot. 2 features are plotted at a time with color indicate wine quality. Similar trend as the last figure can be observed. In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content. As we can see, when volatile acidity is greater than 1, the probability of the wine being excellent is zero. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. However, when volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Moreover, any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad. Therefore, volatile acidity is a good predictor for bad wines.

Observe that lower sulphates content typically leads to a bad wine.Average wines have higher concentrations of sulphates, however for citric content Excellent wines are increased with increase in concentration.

Plot Three: Boxplotting Main Features

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.20   10.98   13.10 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.00   10.26   10.90   14.00 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   10.80   11.60   11.54   12.22   14.00

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0750  0.1713  0.2675  1.0000 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2538  0.4000  0.7600 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.3950  0.3687  0.4900  0.7600

## redwine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4925  0.5600  0.5927  0.6000  2.0000 
## -------------------------------------------------------- 
## redwine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6457  0.7000  1.9800 
## -------------------------------------------------------- 
## redwine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7444  0.8200  1.3600

Description Three

This graph was interesting because it showed how excellent wines tended to have a higher alcohol content all else equal. By this I mean certain precursors had to exist for alcohol to be the predominant determininant for quality. The 4 features that have the highest correlation coefficient are alcohol, volatile acidity, sulphates, citric acid. The wine quality are grouped to low (3,4), medium (5.6) and high (7,8).High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Volatile acidity decrease as wine quality increases. Sulphate and critic acid increase as wine quality increase.


Reflection

This analysis was conducted conducted with the view of trying to uncover hidden insights by move a step at a time and proceeding further or retracting backwards based on the outcome. It was at times unbelievable at times when the hypothesis was incorrect, but it did make sense. The most important thing that influenced the direction on the analysis was some sort of patterns that unravelled.

The biggest struggle in this process was working though the number of iterations needed to get the results out correctly, which in itself is a very tedious process. I felt like giving up at times, but instead I decided to work through it one step at a time.

In the future analysis, it would make sense to carry out analysis based on the free radicals.

The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.

In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, and good amount of alchohol.